Table of Contents¶
1. Import the Dataset and Explore the Data
Check data contents, provide descriptive statistics, and check for incoherencies in the data.
Explore data visually and extract relevant insights.
Explain your rationale and findings.
Do not forget to analyze multivariate relationships.
1.1 Importing Libraries
1.2 Loading and Reading the Dataset
1.3 Descriptive Statistics
1.4 Incoherencies
1.5 Exploring Data Visually
1.6 Multivariate Relationships
2. Clean and Pre-process the Data
Are there any missing values? Take action to handle them.
Check the dataset for outliers and pre-process them. Justify your decisions.
Deal with categorical variables.
Review current features and create extra features if needed. Explain your steps.
Perform data scaling. Explain the reasoning behind your choices.
2.1 Missing Values
2.2 Duplicates
2.3 Outliers
2.4 Categorical Data
2.5 Aggregations
2.6 Feature Engineering
2.7 Data Scaling
3. Feature Selection
Define and implement an unambiguous strategy for feature selection.
Use methods discussed in the course.
Present and justify your final selection.
3.1 Filter Methods
3.1.1 Univariate Variables
3.1.2 Correlation Indices
3.1.3 Chi-Square
3.2 Wrapper Methods
3.2.1 RFE
3.3 Embedded Methods
3.3.1 Lasso
3.4 Final Insights
4. Build a Simple Model and Assess the Performance
Identify the type of the problem and select the relevant algorithms.
Use cross-validation to assess performance. Which metrics did you use and why?
Train at least 1 model using the train dataet with a macro F1-score of more than 0.6.
et with a macro F1-score of more than 0.6.
et with a macro F1-score of more than 0.6.
a macro F1-score of more than 0.6.
0. Context ¶
The New York Workers’ Compensation Board (train_data) administers and regulates workers’ compensation, disability, and other workers’ benefits.
train_data is responsible for assembling and deciding on claims whenever it becomes aware of a workplace injury. Since 2000, the train_data has assembled and reviewed more than 5 million claims. However, manually reviewing all claims is an arduous and time-consuming process. For that reason, the train_data has reached out to Nova IMS to assist them in the creation of a model that can automate the decision-making whenever a new claim is received.
Our task is to create a classification model that can accurately predict the train_data’s final decision on what type o injury (Claim Injury Type) should be given to a caim.l To do that, the train_data has provided labelled data with all claims assembled betwee 2020 and 22.02 <b.
Import the Dataset and Explore the Data ¶
1.1 Importing Libraries ¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from math import ceil
import warnings
%config InlineBackend.figure_format = 'retina' # optionally, you can change 'svg' to 'retina'
sns.set()
warnings.filterwarnings('ignore')
1.2 Loading and Reading the Dataset ¶
train_data = pd.read_csv('train_data.csv', sep = ',', low_memory=False) #sep is good to seperate data
pd.set_option('display.max_columns', None) #to be able too see all columns
train_data.head(5)
| Accident Date | Age at Injury | Alternative Dispute Resolution | Assembly Date | Attorney/Representative | Average Weekly Wage | Birth Year | C-2 Date | C-3 Date | Carrier Name | Carrier Type | Claim Identifier | Claim Injury Type | County of Injury | COVID-19 Indicator | District Name | First Hearing Date | Gender | IME-4 Count | Industry Code | Industry Code Description | Medical Fee Region | OIICS Nature of Injury Description | WCIO Cause of Injury Code | WCIO Cause of Injury Description | WCIO Nature of Injury Code | WCIO Nature of Injury Description | WCIO Part Of Body Code | WCIO Part Of Body Description | Zip Code | Agreement Reached | WCB Decision | Number of Dependents | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019-12-30 | 31.0 | N | 2020-01-01 | N | 0.00 | 1988.0 | 2019-12-31 | NaN | NEW HAMPSHIRE INSURANCE CO | 1A. PRIVATE | 5393875 | 2. NON-COMP | ST. LAWRENCE | N | SYRACUSE | NaN | M | NaN | 44.0 | RETAIL TRADE | I | NaN | 27.0 | FROM LIQUID OR GREASE SPILLS | 10.0 | CONTUSION | 62.0 | BUTTOCKS | 13662 | 0.0 | Not Work Related | 1.0 |
| 1 | 2019-08-30 | 46.0 | N | 2020-01-01 | Y | 1745.93 | 1973.0 | 2020-01-01 | 2020-01-14 | ZURICH AMERICAN INSURANCE CO | 1A. PRIVATE | 5393091 | 4. TEMPORARY | WYOMING | N | ROCHESTER | 2020-02-21 | F | 4.0 | 23.0 | CONSTRUCTION | I | NaN | 97.0 | REPETITIVE MOTION | 49.0 | SPRAIN OR TEAR | 38.0 | SHOULDER(S) | 14569 | 1.0 | Not Work Related | 4.0 |
| 2 | 2019-12-06 | 40.0 | N | 2020-01-01 | N | 1434.80 | 1979.0 | 2020-01-01 | NaN | INDEMNITY INSURANCE CO OF | 1A. PRIVATE | 5393889 | 4. TEMPORARY | ORANGE | N | ALBANY | NaN | M | NaN | 56.0 | ADMINISTRATIVE AND SUPPORT AND WASTE MANAGEMEN... | II | NaN | 79.0 | OBJECT BEING LIFTED OR HANDLED | 7.0 | CONCUSSION | 10.0 | MULTIPLE HEAD INJURY | 12589 | 0.0 | Not Work Related | 6.0 |
| 3 | NaN | NaN | NaN | 2020-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 957648180 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 2019-12-30 | 61.0 | N | 2020-01-01 | N | NaN | 1958.0 | 2019-12-31 | NaN | STATE INSURANCE FUND | 2A. SIF | 5393887 | 2. NON-COMP | DUTCHESS | N | ALBANY | NaN | M | NaN | 62.0 | HEALTH CARE AND SOCIAL ASSISTANCE | II | NaN | 16.0 | HAND TOOL, UTENSIL; NOT POWERED | 43.0 | PUNCTURE | 36.0 | FINGER(S) | 12603 | 0.0 | Not Work Related | 1.0 |
Metadata¶
Claim Dates
Accident Date Injury date of the claim.
Assembly Date The date the claim was first assembled.
C-2 Date Date of receipt of the Employer's Report of Work-Related Injury/Illness or equivalent (formerly Form C-2).
C-3 Date Date Form C-3 (Employee Claim Form) was received.
First Hearing Date Date the first hearing was held on a claim at a train_data hearing location. A blank date means the claim has not yet had a hearing held.
Worker Demographics
Age at Injury Age of injured worker when the injury occurred.
Birth Year The reported year of birth of the injured worker.
Gender The reported gender of the injured worker.
Zip Code The reported ZIP code of the injured worker’s home address.
Claim and Case Information
Alternative Dispute Resolution Adjudication processes external to the Board.
Attorney/Representative Is the claim being represented by an Attorney?
Claim Identifier Unique identifier for each claim, assigned by train_data.
Carrier Name Name of primary insurance provider responsible for providing workers’ compensation coverage to the injured worker’s employer.
Carrier Type Type of primary insurance provider responsible for providing workers’ compensation coverage.
Average Weekly Wage The wage used to calculate workers’ compensation, disability, or paid leave wage replacement benefits.
Location and Region
County of Injury Name of the New York County where the injury occurred.
District Name Name of the train_data district office that oversees claims for that region or area of the state.
Medical Fee Region Approximate region where the injured worker would receive medical service.
Incident and Injury Details
COVID-19 Indicator Indication that the claim may be associated with COVID-19.
IME-4 Count Number of IME-4 forms received per claim. The IME-4 form is the “Independent Examiner's Report of Independent Medical Examination” form.
Industry Classification
Industry Code NAICS code and descriptions are available at https://www.naics.com/search-naics-codes-by-industry/.
Industry Code Description 2-digit NAICS industry code description used to classify businesses according to their economic activity.
Injury Descriptions and Codes
OIICS Nature of Injury Description The OIICS nature of injury codes & descriptions are available at https://www.bls.gov/iif/oiics_manual_2007.pdf.
WCIO Cause of Injury Code The WCIO cause of injury codes & descriptions are available at https://www.wcio.org/Active%20PNC/WCIO_Cause_Table.pdf.
WCIO Cause of Injury Description See description of field above.
WCIO Nature of Injury Code The WCIO nature of injury codes are available at https://www.wcio.org/Active%20PNC/WCIO_Nature_Table.pdf.
WCIO Nature of Injury Description See description of field above.
WCIO Part Of Body Code The WCIO part of body codes & descriptions are available at https://www.wcio.org/Active%20PNC/WCIO_Part_Table.pdf.
WCIO Part Of Body Description See description of field above.
Claim Outcomes
Agreement Reached Binary variable: Yes if there is an agreement without the involBement of the train_data; otherwise unknown at the start of a claim.
train_data Decision Multiclass variable: Decision of the train_data relative to the claim; "Accident" indicates a workplace accident, and "Occupational Disease" indicates illness from the workplace, both of which require train_data deliberation and may be unknown at the claim's start.
Claim Injury Type Main target variable: Deliberation of the train_data relative to benefits awarded to the claim, with numbering indicating severity.
aim, with numbering indicating severity.
1.3 Descriptive Statistics ¶
Shape¶
train_data.shape
(593471, 33)
Change in Datatypes:¶
float -> integer
Age at Injury
Birth Year
IME-4 Count
Number of Dependents
Agreement Reached
float -> object
OIICS Nature of Injury Description
Object -> Dates
C-2 Date
C-3 Date
First Hearing Date
Accident Date
Assembly Dates
Change in datatypes¶
Data conversion to integer:
train_data_to_int = ['Age at Injury', 'Birth Year', 'IME-4 Count', 'Number of Dependents', 'Agreement Reached']
for col in train_data_to_int:
# Convert to numeric and handle NaNs by keeping them as NaN, then cast to Int64
train_data[col] = pd.to_numeric(train_data[col], errors='coerce').astype("Int64")
Data conversion to object:
train_data_to_obj = ['OIICS Nature of Injury Description']
for col in train_data_to_obj:
train_data[col] = train_data[col].astype('object')
train_data_float_to_int = ['WCIO Cause of Injury Code','WCIO Nature of Injury Code','WCIO Part Of Body Code']
for col in train_data_float_to_int:
# Convert the column to numeric, then to Int64 (nullable integer type)
train_data[col] = pd.to_numeric(train_data[col], errors='coerce').astype('Int64')
train_data_to_obj = ['OIICS Nature of Injury Description']
for col in train_data_to_obj:
train_data[col] = train_data[col].astype('object')
Categorical Data from number to object/category:
train_data_num_to_object= ['Industry Code', 'Agreement Reached', 'WCIO Cause of Injury Code',
'WCIO Nature of Injury Code', 'WCIO Part Of Body Code','Claim Identifier']
for col in train_data_num_to_object:
# Convert the column to an object while keeping NaNs
train_data[col] = train_data[col].astype('object')
Dates from object to datetime format:
# Convert dates treated as objects to datetime format
train_data['C-2 Date'] = pd.to_datetime(train_data['C-2 Date'], errors='coerce')
train_data['C-3 Date'] = pd.to_datetime(train_data['C-3 Date'], errors='coerce')
train_data['Accident Date'] = pd.to_datetime(train_data['Accident Date'], errors='coerce')
train_data['First Hearing Date'] = pd.to_datetime(train_data['First Hearing Date'], errors='coerce')
train_data['Assembly Date'] = pd.to_datetime(train_data['Assembly Date'], errors='coerce')
train_data.dtypes
Accident Date datetime64[ns] Age at Injury Int64 Alternative Dispute Resolution object Assembly Date datetime64[ns] Attorney/Representative object Average Weekly Wage float64 Birth Year Int64 C-2 Date datetime64[ns] C-3 Date datetime64[ns] Carrier Name object Carrier Type object Claim Identifier object Claim Injury Type object County of Injury object COVID-19 Indicator object District Name object First Hearing Date datetime64[ns] Gender object IME-4 Count Int64 Industry Code object Industry Code Description object Medical Fee Region object OIICS Nature of Injury Description object WCIO Cause of Injury Code object WCIO Cause of Injury Description object WCIO Nature of Injury Code object WCIO Nature of Injury Description object WCIO Part Of Body Code object WCIO Part Of Body Description object Zip Code object Agreement Reached object WCB Decision object Number of Dependents Int64 dtype: object
train_data.dtypes.astype(str).value_counts().sort_values(ascending=False)
object 23 datetime64[ns] 5 Int64 4 float64 1 Name: count, dtype: int64
Column with unexpected negative values¶
train_data[train_data['WCIO Part Of Body Code'] < 0]
| Accident Date | Age at Injury | Alternative Dispute Resolution | Assembly Date | Attorney/Representative | Average Weekly Wage | Birth Year | C-2 Date | C-3 Date | Carrier Name | Carrier Type | Claim Identifier | Claim Injury Type | County of Injury | COVID-19 Indicator | District Name | First Hearing Date | Gender | IME-4 Count | Industry Code | Industry Code Description | Medical Fee Region | OIICS Nature of Injury Description | WCIO Cause of Injury Code | WCIO Cause of Injury Description | WCIO Nature of Injury Code | WCIO Nature of Injury Description | WCIO Part Of Body Code | WCIO Part Of Body Description | Zip Code | Agreement Reached | WCB Decision | Number of Dependents | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 128 | 2019-12-19 | 42 | N | 2020-01-01 | Y | 1093.74 | 1977 | 2019-12-31 | 2020-01-14 | PUBLIC EMPLOYERS RISK MGMT. | 3A. SELF PUBLIC | 5393974 | 4. TEMPORARY | DUTCHESS | N | ALBANY | 2020-08-27 | M | 6 | 92.0 | PUBLIC ADMINISTRATION | II | NaN | 99 | OTHER - MISCELLANEOUS, NOC | 52 | STRAIN OR TEAR | -9 | MULTIPLE | 12578 | 1 | Not Work Related | 6 |
| 235 | 2019-09-13 | 36 | N | 2020-01-02 | Y | 1500.00 | 1983 | 2020-02-07 | 2019-12-27 | POLICE, FIRE, SANITATION | 3A. SELF PUBLIC | 5393785 | 5. PPD SCH LOSS | SUFFOLK | N | HAUPPAUGE | 2020-03-12 | F | 3 | 92.0 | PUBLIC ADMINISTRATION | IV | NaN | 81 | STRUCK OR INJURED, NOC | 10 | CONTUSION | -9 | MULTIPLE | 11704 | 0 | Not Work Related | 5 |
| 790 | 2019-12-12 | 57 | N | 2020-01-03 | Y | 764.35 | 1962 | 2020-01-03 | 2020-01-09 | PROPERTY AND CASUALTY | 1A. PRIVATE | 5395679 | 5. PPD SCH LOSS | BRONX | N | NYC | 2020-02-18 | M | <NA> | 31.0 | MANUFACTURING | IV | NaN | 29 | ON SAME LEVEL | 52 | STRAIN OR TEAR | -9 | MULTIPLE | 10466 | 0 | Not Work Related | 4 |
| 1006 | 2018-01-06 | 50 | N | 2020-01-03 | N | 0.00 | 1967 | 2020-01-03 | NaT | NEW HAMPSHIRE INSURANCE CO | 1A. PRIVATE | 5395006 | 2. NON-COMP | BRONX | N | NYC | NaT | M | <NA> | 23.0 | CONSTRUCTION | UK | NaN | 97 | REPETITIVE MOTION | 59 | ALL OTHER SPECIFIC INJURIES, NOC | -9 | MULTIPLE | 07726 | 0 | Not Work Related | 0 |
| 2276 | 2019-12-26 | 33 | N | 2020-01-06 | Y | 670.00 | <NA> | 2020-01-06 | 2020-12-08 | SYRACUSE, CITY OF | 3A. SELF PUBLIC | 5396338 | 4. TEMPORARY | ONONDAGA | N | SYRACUSE | 2021-02-16 | M | 9 | 92.0 | PUBLIC ADMINISTRATION | II | NaN | 56 | LIFTING | 52 | STRAIN OR TEAR | -9 | MULTIPLE | 13208 | 0 | Not Work Related | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 593433 | 2022-11-23 | 30 | N | 2022-12-30 | Y | 0.00 | 1992 | 2023-01-06 | 2022-12-21 | WESCO INSURANCE COMPANY | 1A. PRIVATE | 6164364 | 2. NON-COMP | QUEENS | N | NYC | NaT | F | <NA> | 44.0 | RETAIL TRADE | IV | NaN | 26 | FROM LADDER OR SCAFFOLDING | 52 | STRAIN OR TEAR | -9 | MULTIPLE | NaN | 0 | Not Work Related | 5 |
| 593438 | 2022-05-03 | 52 | N | 2022-12-30 | N | 1555.03 | 1970 | 2022-12-29 | NaT | ONONDAGA COUNTY SELF INS DIV | 3A. SELF PUBLIC | 6165182 | 4. TEMPORARY | ONONDAGA | N | SYRACUSE | NaT | F | 2 | 92.0 | PUBLIC ADMINISTRATION | II | NaN | 97 | REPETITIVE MOTION | 78 | CARPAL TUNNEL SYNDROME | -9 | MULTIPLE | 13215 | 0 | Not Work Related | 3 |
| 593439 | 2022-12-26 | 30 | N | 2022-12-30 | Y | 1234.19 | 1992 | 2022-12-30 | NaT | TONAWANDA, TOWN OF | 3A. SELF PUBLIC | 6165293 | 4. TEMPORARY | ERIE | N | BUFFALO | 2024-05-21 | M | 1 | 92.0 | PUBLIC ADMINISTRATION | I | NaN | 79 | OBJECT BEING LIFTED OR HANDLED | 90 | MULTIPLE PHYSICAL INJURIES ONLY | -9 | MULTIPLE | 14150 | 0 | Not Work Related | 3 |
| 593442 | 2022-12-22 | 53 | N | 2022-12-30 | Y | 0.00 | 1969 | 2022-12-30 | NaT | INDEMNITY INS. OF N AMERICA | 1A. PRIVATE | 6165175 | 2. NON-COMP | NASSAU | N | NYC | NaT | F | 1 | 72.0 | ACCOMMODATION AND FOOD SERVICES | IV | NaN | 29 | ON SAME LEVEL | 52 | STRAIN OR TEAR | -9 | MULTIPLE | 11570 | 0 | Not Work Related | 1 |
| 593467 | 2022-12-13 | 72 | N | 2022-12-31 | N | 0.00 | 1950 | 2022-12-31 | NaT | TECHNOLOGY INSURANCE CO. INC. | 1A. PRIVATE | 6165075 | 2. NON-COMP | SULLIVAN | N | BINGHAMTON | NaT | F | <NA> | 48.0 | TRANSPORTATION AND WAREHOUSING | I | NaN | 25 | FROM DIFFERENT LEVEL (ELEVATION) | 90 | MULTIPLE PHYSICAL INJURIES ONLY | -9 | MULTIPLE | 12779 | 0 | Not Work Related | 3 |
42011 rows × 33 columns
Columns that should have matching number of unique values:¶
WCIO Part Of Body Code<> WCIO Part Of Body Description
Industry Code<> Industry Code Description
WCIO Cause of Injury Code<> WCIO Cause of Injury Description
Lets analyse what happened
WCIO Part Of Body Code<> WCIO Part Of Body Description
train_data[['WCIO Part Of Body Code', 'WCIO Part Of Body Description']].nunique()
#there are 3 more codes than descriptions
WCIO Part Of Body Code 57 WCIO Part Of Body Description 54 dtype: int64
# Remove duplicatas com base no código e mantém apenas a primeira ocorrência de cada código
body_uniques = train_data.drop_duplicates(subset=['WCIO Part Of Body Code'])[['WCIO Part Of Body Code', 'WCIO Part Of Body Description']]
# Conta a frequência de cada descrição na lista sem duplicatas de código
body_repeats = body_uniques['WCIO Part Of Body Description'].value_counts()
# Filtra para mostrar apenas as descrições que se repetem
body_repeats[body_repeats > 1]
WCIO Part Of Body Description DISC 2 SOFT TISSUE 2 SPINAL CORD 2 Name: count, dtype: int64
body_uniques[body_uniques['WCIO Part Of Body Description'].isin(['DISC', 'SOFT TISSUE', 'SPINAL CORD'])][['WCIO Part Of Body Code', 'WCIO Part Of Body Description']]
#23: spinal cord
#18: soft tissue
#43: disc
| WCIO Part Of Body Code | WCIO Part Of Body Description | |
|---|---|---|
| 29 | 23 | SPINAL CORD |
| 41 | 18 | SOFT TISSUE |
| 45 | 25 | SOFT TISSUE |
| 113 | 43 | DISC |
| 151 | 22 | DISC |
| 2761 | 47 | SPINAL CORD |
Industry Code<> Industry Code Description
train_data[['Industry Code', 'Industry Code Description']].nunique()
Industry Code 24 Industry Code Description 20 dtype: int64
# Remove duplicatas com base no código e mantém apenas a primeira ocorrência de cada código
industry_uniques = train_data.drop_duplicates(subset=['Industry Code'])[['Industry Code', 'Industry Code Description']]
# Conta a frequência de cada descrição na lista sem duplicatas de código
industry_repeats = industry_uniques['Industry Code Description'].value_counts()
# Filtra para mostrar apenas as descrições que se repetem
industry_repeats[industry_repeats > 1]
Industry Code Description MANUFACTURING 3 RETAIL TRADE 2 TRANSPORTATION AND WAREHOUSING 2 Name: count, dtype: int64
industry_uniques[industry_uniques['Industry Code Description'].isin(['MANUFACTURING', 'RETAIL TRADE', 'TRANSPORTATION AND WAREHOUSING'])][['Industry Code', 'Industry Code Description']]
#44: Retail Trade
#31 Manufacturing
#48 TRANSPORTATION AND WAREHOUSING
| Industry Code | Industry Code Description | |
|---|---|---|
| 0 | 44.0 | RETAIL TRADE |
| 16 | 31.0 | MANUFACTURING |
| 17 | 48.0 | TRANSPORTATION AND WAREHOUSING |
| 18 | 45.0 | RETAIL TRADE |
| 31 | 49.0 | TRANSPORTATION AND WAREHOUSING |
| 65 | 33.0 | MANUFACTURING |
| 149 | 32.0 | MANUFACTURING |
WCIO Cause of Injury Code<> WCIO Cause of Injury Description
train_data[['WCIO Cause of Injury Code','WCIO Cause of Injury Description']].nunique()
WCIO Cause of Injury Code 77 WCIO Cause of Injury Description 74 dtype: int64
# Remove duplicatas com base no código e mantém apenas a primeira ocorrência de cada código
cinjury_uniques = train_data.drop_duplicates(subset=['WCIO Cause of Injury Code'])[['WCIO Cause of Injury Code','WCIO Cause of Injury Description']]
# Conta a frequência de cada descrição na lista sem duplicatas de código
cinjury_repeats = cinjury_uniques['WCIO Cause of Injury Description'].value_counts()
# Filtra para mostrar apenas as descrições que se repetem
cinjury_repeats[cinjury_repeats > 1]
WCIO Cause of Injury Description OBJECT BEING LIFTED OR HANDLED 3 REPETITIVE MOTION 2 Name: count, dtype: int64
cinjury_uniques[cinjury_uniques['WCIO Cause of Injury Description'].isin(['OBJECT BEING LIFTED OR HANDLED', 'REPETITIVE MOTION'])][['WCIO Cause of Injury Code', 'WCIO Cause of Injury Description']]
#97: Repetitive Motion (replace 94)
#79: OBJECT BEING LIFTED OR HANDLED (replace 17,66)
| WCIO Cause of Injury Code | WCIO Cause of Injury Description | |
|---|---|---|
| 1 | 97 | REPETITIVE MOTION |
| 2 | 79 | OBJECT BEING LIFTED OR HANDLED |
| 90 | 17 | OBJECT BEING LIFTED OR HANDLED |
| 141 | 66 | OBJECT BEING LIFTED OR HANDLED |
| 2953 | 94 | REPETITIVE MOTION |
Columns with both numbers and string as values¶
Carrier Type
train_data['Carrier Type'].unique()
#1A. PRIVATE
#2A. SIF, -> STATE INSURANCE FUND (acc to google:stands for Subsequent Injury Fund??)
#3A. SELF PUBLIC
#4A. SELF PRIVATE
#5A. SPECIAL FUND - CONS. COMM. (SECT. 25-A)
#5C. SPECIAL FUND - POI CARRIER train_data MENANDS
#5D. SPECIAL FUND - UNKNOWN
#NAN
#UNKNOWN
array(['1A. PRIVATE', nan, '2A. SIF', '4A. SELF PRIVATE',
'3A. SELF PUBLIC', 'UNKNOWN', '5D. SPECIAL FUND - UNKNOWN',
'5A. SPECIAL FUND - CONS. COMM. (SECT. 25-A)',
'5C. SPECIAL FUND - POI CARRIER WCB MENANDS'], dtype=object)
Claim Injury Type
train_data['Claim Injury Type'].unique()
#1. CANCELLED
#2. NON-COMP
#3. MED ONLY
#4. TEMPORARY
#5. PPD SCH LOSS
#6. PPD NSL
#7. PTD
#8. DEATH
array(['2. NON-COMP', '4. TEMPORARY', nan, '3. MED ONLY',
'5. PPD SCH LOSS', '6. PPD NSL', '1. CANCELLED', '8. DEATH',
'7. PTD'], dtype=object)
1.5 Exploring Data Visually ¶
Numerical Data Visualization¶
For numerical data, we plotted an histogram showing the data distribution:
this visualization is not ideal for Average Weekly Wage.
# Select numerical attributes with continuous values
numeric_attributes = train_data.loc[:, ['Age at Injury','Average Weekly Wage', 'Birth Year','IME-4 Count','Number of Dependents']]
# Set the number of subplots
num_attributes = len(numeric_attributes.columns)
num_cols = 3 # Set the number of columns for the subplots
num_rows = (num_attributes // num_cols) + (num_attributes % num_cols > 0) # Calculate rows needed
# Create subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, num_rows * 4))
axes = axes.flatten() # Flatten the 2D array of axes
# Loop through each numerical attribute and plot its distribution
for i, column in enumerate(numeric_attributes.columns):
sns.histplot(numeric_attributes[column], bins=30, kde=True, ax=axes[i]) # KDE adds a density curve
axes[i].set_title(f'Distribution of {column}')
axes[i].set_xlabel(column)
axes[i].set_ylabel('Frequency')
# Hide any unused subplots
for j in range(i + 1, len(axes)):
axes[j].axis('off')
plt.tight_layout()
plt.show()
We can conclude that:¶
- only 'Age at Injury' follows a normal distribution
categorical_columns = train_data.loc[:, ['Alternative Dispute Resolution', 'Attorney/Representative', 'Carrier Type', 'Claim Injury Type', 'County of Injury',
'COVID-19 Indicator', 'District Name', 'Gender',
'Industry Code Description', 'Medical Fee Region',
'OIICS Nature of Injury Description',
'WCIO Cause of Injury Description', 'WCIO Nature of Injury Description',
'WCIO Part Of Body Description', 'Agreement Reached',
'WCB Decision']]
# Filter only categorical columns with non-empty data
valid_categorical_columns = [
col for col in categorical_columns
if train_data[col].notna().any() # Check if the column has at least one non-NaN value
]
# Plot value counts for each valid categorical column
for col in valid_categorical_columns:
plt.figure(figsize=(16, 8))
value_counts = train_data[col].value_counts()
if not value_counts.empty: # Check if there are values to plot
value_counts.plot(kind="bar")
plt.title(f'Value Counts of {col}')
plt.xlabel(col)
plt.ylabel('Count')
plt.show()
else:
print(f"Skipping column '{col}' as it contains no valid data.")
# Define age categories
bins = [0, 15, 18, 31, 42, 54, 75, 117]
labels = ['0-15', '16-18', '19-31', '32-42', '43-54', '55-71', '+75']
# Create a new column for age categories
train_data['Age Category'] = pd.cut(train_data['Age at Injury'], bins=bins, labels=labels, right=False)
age_distribution = train_data['Age Category'].value_counts()
# Plot the pie chart
plt.figure(figsize=(14, 10)) # Set the figure size
age_distribution.plot(kind='pie', autopct='%1.1f%%', colors=['skyblue', 'lightgreen', 'coral', 'orange', 'purple', 'yellow', 'pink'])
# Customize the plot
plt.title('Age at Injury')
plt.ylabel('') # Remove the default y-label (which shows "Age Category")
# Show the plot
plt.show()
corr = train_data[numeric_attributes].corr(method="pearson").round(2)
mask_annot = np.absolute(corr.values) >= 0.45
annot = np.where(mask_annot, corr.values, np.full(corr.shape,""))
fig = plt.figure(figsize=(20, 15))
sns.heatmap(data=corr,
annot=annot,
fmt='s',
vmin=-1, vmax=1, center=0,
square=True, linewidths=.5,
cmap='PiYG')
plt.show()
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\internals\blocks.py:2328, in ExtensionBlock.fillna(self, value, limit, inplace, downcast, using_cow, already_warned) 2327 try: -> 2328 new_values = self.values.fillna( 2329 value=value, method=None, limit=limit, copy=copy 2330 ) 2331 except TypeError: 2332 # 3rd party EA that has not implemented copy keyword yet File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\arrays\masked.py:266, in BaseMaskedArray.fillna(self, value, method, limit, copy) 265 new_values = self[:] --> 266 new_values[mask] = value 267 else: File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\arrays\masked.py:314, in BaseMaskedArray.__setitem__(self, key, value) 313 else: --> 314 value = self._validate_setitem_value(value) 315 self._data[key] = value File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\arrays\masked.py:305, in BaseMaskedArray._validate_setitem_value(self, value) 301 # TODO: unsigned checks 302 303 # Note: without the "str" here, the f-string rendering raises in 304 # py38 builds. --> 305 raise TypeError(f"Invalid value '{str(value)}' for dtype {self.dtype}") TypeError: Invalid value 'False' for dtype Int64 During handling of the above exception, another exception occurred: TypeError Traceback (most recent call last) Cell In[28], line 1 ----> 1 corr = train_data[numeric_attributes].corr(method="pearson").round(2) 2 mask_annot = np.absolute(corr.values) >= 0.45 3 annot = np.where(mask_annot, corr.values, np.full(corr.shape,"")) File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\frame.py:4089, in DataFrame.__getitem__(self, key) 4087 # Do we have a (boolean) DataFrame? 4088 if isinstance(key, DataFrame): -> 4089 return self.where(key) 4091 # Do we have a (boolean) 1d indexer? 4092 if com.is_bool_indexer(key): File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\generic.py:10984, in NDFrame.where(self, cond, other, inplace, axis, level) 10977 warnings.warn( 10978 _chained_assignment_warning_method_msg, 10979 FutureWarning, 10980 stacklevel=2, 10981 ) 10983 other = common.apply_if_callable(other, self) > 10984 return self._where(cond, other, inplace, axis, level) File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\generic.py:10660, in NDFrame._where(self, cond, other, inplace, axis, level, warn) 10654 with warnings.catch_warnings(): 10655 warnings.filterwarnings( 10656 "ignore", 10657 "Downcasting object dtype arrays", 10658 category=FutureWarning, 10659 ) > 10660 cond = cond.fillna(fill_value) 10661 cond = cond.infer_objects(copy=False) 10663 msg = "Boolean array expected for the condition, not {dtype}" File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\generic.py:7434, in NDFrame.fillna(self, value, method, axis, inplace, limit, downcast) 7432 new_data = result._mgr 7433 else: -> 7434 new_data = self._mgr.fillna( 7435 value=value, limit=limit, inplace=inplace, downcast=downcast 7436 ) 7437 elif isinstance(value, ABCDataFrame) and self.ndim == 2: 7438 new_data = self.where(self.notna(), value)._mgr File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\internals\base.py:186, in DataManager.fillna(self, value, limit, inplace, downcast) 182 if limit is not None: 183 # Do this validation even if we go through one of the no-op paths 184 limit = libalgos.validate_limit(None, limit=limit) --> 186 return self.apply_with_block( 187 "fillna", 188 value=value, 189 limit=limit, 190 inplace=inplace, 191 downcast=downcast, 192 using_cow=using_copy_on_write(), 193 already_warned=_AlreadyWarned(), 194 ) File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\internals\managers.py:363, in BaseBlockManager.apply(self, f, align_keys, **kwargs) 361 applied = b.apply(f, **kwargs) 362 else: --> 363 applied = getattr(b, f)(**kwargs) 364 result_blocks = extend_blocks(applied, result_blocks) 366 out = type(self).from_blocks(result_blocks, self.axes) File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\internals\blocks.py:2334, in ExtensionBlock.fillna(self, value, limit, inplace, downcast, using_cow, already_warned) 2331 except TypeError: 2332 # 3rd party EA that has not implemented copy keyword yet 2333 refs = None -> 2334 new_values = self.values.fillna(value=value, method=None, limit=limit) 2335 # issue the warning *after* retrying, in case the TypeError 2336 # was caused by an invalid fill_value 2337 warnings.warn( 2338 # GH#53278 2339 "ExtensionArray.fillna added a 'copy' keyword in pandas " (...) 2345 stacklevel=find_stack_level(), 2346 ) File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\arrays\masked.py:266, in BaseMaskedArray.fillna(self, value, method, limit, copy) 264 else: 265 new_values = self[:] --> 266 new_values[mask] = value 267 else: 268 if copy: File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\arrays\masked.py:314, in BaseMaskedArray.__setitem__(self, key, value) 312 self._mask[key] = True 313 else: --> 314 value = self._validate_setitem_value(value) 315 self._data[key] = value 316 self._mask[key] = False File c:\Users\mainj\anaconda3\Lib\site-packages\pandas\core\arrays\masked.py:305, in BaseMaskedArray._validate_setitem_value(self, value) 300 return value 301 # TODO: unsigned checks 302 303 # Note: without the "str" here, the f-string rendering raises in 304 # py38 builds. --> 305 raise TypeError(f"Invalid value '{str(value)}' for dtype {self.dtype}") TypeError: Invalid value 'False' for dtype Int64
box_plot_features = ['Age at Injury', 'Average Weekly Wage', 'Claim Injury Type', 'First Hearing Date', 'Accident Date', 'Assembly Date']
sns.set()
fig, axes = plt.subplots(2, ceil(len(box_plot_features) / 2), figsize=(40, 11))
for ax, feat in zip(axes.flatten(), box_plot_features):
sns.boxplot(x=train_data[feat], ax=ax)
title = "Numeric Variables' Box Plots"
plt.suptitle(title)
plt.show()
# Configuração do tamanho e estilo do gráfico
plt.figure(figsize=(12, 6))
sns.set(style="whitegrid")
# Gráfico de boxplot para visualizar a idade em cada tipo de lesão
sns.boxplot(x='Claim Injury Type', y='Age at Injury', data=train_data)
# Ajuste dos rótulos e título
plt.xlabel('Claim Injury Type')
plt.ylabel('Age at Injury')
plt.title('Distribution of age by Claim Injury Type')
plt.xticks(rotation=45) # Rotaciona os rótulos para melhorar a legibilidade
# Exibe o gráfico
plt.show()
# Configuração do tamanho e estilo
sns.set(style="whitegrid")
# Scatter Plot: Average Weekly Wage x Claim Injury Type
plt.figure(figsize=(12, 6))
sns.scatterplot(x='Average Weekly Wage', y='Claim Injury Type', data=train_data, alpha=0.5)
plt.xlabel('Average Weekly Wage')
plt.ylabel('Claim Injury Type')
plt.title('Scatter Plot: Claim Injury Type vs Average Weekly Wage')
plt.xticks(rotation=45) # Rotaciona os rótulos para melhor legibilidade
plt.show()
# Box Plot: Average Weekly Wage x Claim Injury Type
plt.figure(figsize=(12, 6))
sns.boxplot(x='Average Weekly Wage', y='Claim Injury Type', data=train_data)
plt.xlabel('Average Weekly Wage')
plt.ylabel('Claim Injury Type')
plt.title('Box Plot: Average Weekly Wage vs Claim Injury Type')
plt.xticks(rotation=45) # Rotaciona os rótulos para melhor legibilidade
plt.show()
# Configuração do tamanho e estilo do gráfico
plt.figure(figsize=(10, 6))
sns.set(style="whitegrid")
# Gráfico de barras para contar o número de ocorrências de cada tipo de lesão por gênero
sns.countplot(x='Claim Injury Type', hue='Gender', data=train_data)
# Ajuste dos rótulos e título
plt.xlabel('Claim Injury Type')
plt.ylabel('Count')
plt.title('Distribution of Claim Injury Type by Gender')
plt.xticks(rotation=45) # Rotaciona os rótulos para melhor legibilidade
# Exibe o gráfico
plt.show()
# Calcular as proporções de Claim Injury Type por Gender
proporcoes = train_data.groupby(['Claim Injury Type', 'Gender']).size().reset_index(name='Count')
proporcoes['Proporção'] = proporcoes.groupby('Claim Injury Type')['Count'].transform(lambda x: x / x.sum())
# Configuração do tamanho e estilo do gráfico
plt.figure(figsize=(10, 6))
sns.set(style="whitegrid")
# Gráfico de barras para visualizar a proporção
sns.barplot(x='Claim Injury Type', y='Proporção', hue='Gender', data=proporcoes)
# Ajuste dos rótulos e título
plt.xlabel('Claim Injury Type')
plt.ylabel('Proporcion')
plt.title('Proporcion of Claim Injury Type by Gender')
plt.xticks(rotation=45) # Rotaciona os rótulos para melhor legibilidade
# Exibe o gráfico
plt.show()
cat1 = 'Claim Injury Type'
cat2 = 'Gender'
fig, axes = plt.subplots(1,2, figsize=(12,4))
catpc_df = train_data.groupby([cat1, cat2])[cat2].size().unstack()
catpc_df.plot.bar(stacked=True, ax=axes[0])
axes[0].set_title('Claim Injury Type vs Gender, Absolute counts')
axes[0].legend([], frameon=False)
catpc_df2 = train_data.groupby([cat1, cat2])[cat2].size() / train_data.groupby([cat1])[cat2].size()
catpc_df2.unstack().plot.bar(stacked=True, ax=axes[1])
axes[1].set_title('Claim Injury Type vs Gender, Relative counts')
axes[1].legend(loc=(1.02,0))
plt.show()
# Contar as ocorrências de Claim Injury Type por Industry Code
heatmap_data = train_data.groupby(['Industry Code Description', 'Claim Injury Type']).size().unstack(fill_value=0)
# Configuração do tamanho e estilo do gráfico
plt.figure(figsize=(12, 8))
sns.set(style="whitegrid")
# Criação do heatmap
sns.heatmap(heatmap_data, annot=True, fmt='g', cmap='Blues', cbar_kws={'label': 'Count'})
# Ajuste dos rótulos e título
plt.xlabel('Claim Injury Type')
plt.ylabel('Industry Code Description')
plt.title('Heatmap: Claim Injury Type by Industry Code Description')
# Exibe o gráfico
plt.show()
# Contar o número de ocorrências de Claim Injury Type por Medical Fee Region
contagem = train_data.groupby(['Medical Fee Region', 'Claim Injury Type']).size().reset_index(name='Count')
# Calcular a soma total de ocorrências por Medical Fee Region
soma_por_regiao = contagem.groupby('Medical Fee Region')['Count'].sum().reset_index()
# Ordenar as regiões pela contagem total
soma_por_regiao = soma_por_regiao.sort_values(by='Count', ascending=False)
# Mapeia as Medical Fee Regions ordenadas
contagem['Medical Fee Region'] = pd.Categorical(contagem['Medical Fee Region'], categories=soma_por_regiao['Medical Fee Region'], ordered=True)
# Calcular a soma total de ocorrências por Claim Injury Type
soma_por_tipo = contagem.groupby('Claim Injury Type')['Count'].sum().reset_index()
# Ordenar os tipos de lesão pela contagem total
soma_por_tipo = soma_por_tipo.sort_values(by='Count', ascending=False)
# Mapeia os Claim Injury Types ordenados
contagem['Claim Injury Type'] = pd.Categorical(contagem['Claim Injury Type'], categories=soma_por_tipo['Claim Injury Type'], ordered=True)
# Configuração do tamanho e estilo do gráfico
plt.figure(figsize=(12, 6))
sns.set(style="whitegrid")
# Gráfico de barras empilhadas
sns.barplot(x='Medical Fee Region', y='Count', hue='Claim Injury Type', data=contagem)
# Ajuste dos rótulos e título
plt.xlabel('Medical Fee Region')
plt.ylabel('Count')
plt.title('Distribution of Claim Injury Type by Medical Fee Region (Sorted)')
plt.xticks(rotation=45) # Rotaciona os rótulos para melhorar a legibilidade
# Exibe o gráfico
plt.legend(title='Claim Injury Type')
plt.show()
cat1 = 'Claim Injury Type'
cat2 = 'Medical Fee Region'
fig, axes = plt.subplots(1,2, figsize=(12,4))
catpc_df = train_data.groupby([cat1, cat2])[cat2].size().unstack()
catpc_df.plot.bar(stacked=True, ax=axes[0])
axes[0].set_title('Claim Injury Type vs Medical Fee Region, Absolute counts')
axes[0].legend([], frameon=False)
catpc_df2 = train_data.groupby([cat1, cat2])[cat2].size() / train_data.groupby([cat1])[cat2].size()
catpc_df2.unstack().plot.bar(stacked=True, ax=axes[1])
axes[1].set_title('Claim Injury Type vs Medical Fee Region, Relative counts')
axes[1].legend(loc=(1.02,0))
plt.show()
# Configuração do tamanho e estilo do gráfico
plt.figure(figsize=(12, 6))
sns.set(style="whitegrid")
# Contar o número de ocorrências de Claim Injury Type por WCIO Cause of Injury Description
contagem = train_data.groupby(['WCIO Cause of Injury Description', 'Claim Injury Type']).size().reset_index(name='Count')
# Calcular a soma total de ocorrências por WCIO Cause of Injury Description
soma_por_causa = contagem.groupby('WCIO Cause of Injury Description')['Count'].sum().reset_index()
# Selecionar as 10 descrições com mais valores
top10_causas = soma_por_causa.sort_values(by='Count', ascending=False).head(6)
# Filtrar o contagem para manter apenas as 10 descrições com mais valores
contagem_top10 = contagem[contagem['WCIO Cause of Injury Description'].isin(top10_causas['WCIO Cause of Injury Description'])]
# Gráfico de pontos
sns.stripplot(x='WCIO Cause of Injury Description', y='Count', hue='Claim Injury Type', data=contagem_top10, dodge=True, jitter=True)
# Ajuste dos rótulos e título
plt.xlabel('WCIO Cause of Injury Description (Top 10)')
plt.ylabel('Count')
plt.title('Scatter Plot: Claim Injury Type by WCIO Cause of Injury Description (Top 10)')
plt.xticks(rotation=90) # Rotaciona os rótulos para melhorar a legibilidade
# Exibe o gráfico
plt.legend(title='Claim Injury Type')
plt.show()
# Contar o número de ocorrências de Claim Injury Type por WCIO Cause of Injury Description
contagem = train_data.groupby(['WCIO Cause of Injury Description', 'Claim Injury Type']).size().reset_index(name='Count')
# Calcular a soma total de ocorrências por WCIO Cause of Injury Description
soma_por_causa = contagem.groupby('WCIO Cause of Injury Description')['Count'].sum().reset_index()
# Selecionar as 10 descrições com mais valores
top10_causas = soma_por_causa.sort_values(by='Count', ascending=False).head(10)
# Filtrar o contagem para manter apenas as 10 descrições com mais valores
contagem_top10 = contagem[contagem['WCIO Cause of Injury Description'].isin(top10_causas['WCIO Cause of Injury Description'])]
# Ordenar os dados pelo total de contagem de cada causa de lesão
contagem_top10 = contagem_top10.sort_values(by=['WCIO Cause of Injury Description', 'Claim Injury Type'])
# Configuração do tamanho e estilo do gráfico
plt.figure(figsize=(12, 6))
sns.set(style="whitegrid")
# Gráfico de barras empilhadas
sns.barplot(x='WCIO Cause of Injury Description', y='Count', hue='Claim Injury Type', data=contagem_top10,
order=top10_causas['WCIO Cause of Injury Description'].tolist())
# Ajuste dos rótulos e título
plt.xlabel('WCIO Cause of Injury Description (Top 10)')
plt.ylabel('Count')
plt.title('Distribution of Claim Injury Type by WCIO Cause of Injury Description (Top 10)')
plt.xticks(rotation=45) # Rotaciona os rótulos para melhorar a legibilidade
# Exibe o gráfico
plt.legend(title='Claim Injury Type')
plt.show()
# Contar o número de ocorrências de Claim Injury Type por WCIO Part of Body Description
contagem_part_body = train_data.groupby(['WCIO Part Of Body Description', 'Claim Injury Type']).size().unstack(fill_value=0)
# Criar o heat map
plt.figure(figsize=(12, 15))
sns.heatmap(contagem_part_body, annot=True, fmt='d', cmap='YlGnBu', cbar=True)
# Ajuste dos rótulos e título
plt.xlabel('Claim Injury Type')
plt.ylabel('WCIO Part of Body Description')
plt.title('Heat Map: WCIO Part of Body Description vs Claim Injury Type')
plt.xticks(rotation=45) # Rotaciona os rótulos para melhorar a legibilidade
plt.yticks(rotation=0) # Mantém os rótulos na horizontal
# Exibe o gráfico
plt.show()
# Contar o número de ocorrências de Claim Injury Type por COVID-19 Indicator
contagem_covid = train_data.groupby(['COVID-19 Indicator', 'Claim Injury Type']).size().reset_index(name='Count')
# Configuração do tamanho e estilo do gráfico
plt.figure(figsize=(10, 6))
sns.set(style="whitegrid")
# Gráfico de barras
sns.barplot(x='COVID-19 Indicator', y='Count', hue='Claim Injury Type', data=contagem_covid)
# Ajuste dos rótulos e título
plt.xlabel('COVID-19 Indicator')
plt.ylabel('Count')
plt.title('Distribution of Claim Injury Type by COVID-19 Indicator')
plt.xticks(rotation=45) # Rotaciona os rótulos para melhorar a legibilidade
# Exibe o gráfico
plt.legend(title='Claim Injury Type')
plt.show()
cat1 = 'COVID-19 Indicator'
cat2 = 'Claim Injury Type'
fig, axes = plt.subplots(1,2, figsize=(12,4))
catpc_df = train_data.groupby([cat1, cat2])[cat2].size().unstack()
catpc_df.plot.bar(stacked=True, ax=axes[0])
axes[0].set_title('Claim Injury Type vs COVID-19 Indicator, Absolute counts')
axes[0].legend([], frameon=False)
catpc_df2 = train_data.groupby([cat1, cat2])[cat2].size() / train_data.groupby([cat1])[cat2].size()
catpc_df2.unstack().plot.bar(stacked=True, ax=axes[1])
axes[1].set_title('Claim Injury Type vs COVID-19 Indicator, Relative Counts')
axes[1].legend(loc=(1.02,0))
plt.show()
# Calculate the date differences
train_data['Hearing_to_Accident'] = (train_data['First Hearing Date'] - train_data['Accident Date']).dt.days
train_data['Assembly_to_Accident'] = (train_data['Assembly Date'] - train_data['Accident Date']).dt.days
train_data['Hearing_to_Assembly'] = (train_data['First Hearing Date'] - train_data['Assembly Date']).dt.days
# Set up the scatter plot
plt.figure(figsize=(12, 8))
# Scatter plot of 'Hearing_to_Accident' vs 'Assembly_to_Accident' colored by 'Claim Injury Type'
sns.scatterplot(
data=train_data,
x='Hearing_to_Accident',
y='Assembly_to_Accident',
hue='Claim Injury Type',
palette='viridis',
alpha=0.7
)
# Adding titles and labels
plt.title('Scatter Plot of Date Differences by Claim Injury Type')
plt.xlabel('Days from Accident to First Hearing')
plt.ylabel('Days from Accident to Assembly Date')
plt.legend(title='Claim Injury Type')
plt.grid(True)
plt.show()
Categorical Data Visualization¶
For categorical data, we plotted a bar chart showing frequency of unique values:
this visualization is not ideal for following attributes, that present more dense values
Zip Code
Carrier Name
[ ] change visual for these attributes
1.6 Multivariate Relationships ¶
Gender <> Day of Year
train_data['Day of Year'] = train_data['Accident Date'].dt.dayofyear
# Map Gender to numerical values (e.g., 0 for Male, 1 for Female)
train_data['Gender Numeric'] = train_data['Gender'].map({'M': 0, 'F': 1})
Useful Aggregations¶
train_data[train_data['C-3 Date'].isna()]['Claim Injury Type'].value_counts()
Claim Injury Type 2 252951 4 69533 3 45065 5 11065 1 6867 6 859 8 423 7 18 Name: count, dtype: int64
train_data[train_data['IME-4 Count'].isna()]['Claim Injury Type'].value_counts()
Claim Injury Type 2 285575 4 76840 3 55113 1 12115 5 10996 8 369 6 211 7 4 Name: count, dtype: int64
train_data[train_data['C-2 Date'].isna()]['Claim Injury Type'].value_counts()
Claim Injury Type 2 7502 1 6710 4 218 3 108 5 21 7 1 Name: count, dtype: int64
# Get value counts for 'Claim Injury Type' where 'C-2 Date' and 'C-3 Date' are NaN
c2_counts = train_data[train_data['C-2 Date'].isna()]['Claim Injury Type'].value_counts()
c3_counts = train_data[train_data['C-3 Date'].isna()]['Claim Injury Type'].value_counts()
c2_c3 = train_data[(train_data['C-3 Date'].isna()) & (train_data['C-2 Date'].isna())]['Claim Injury Type'].value_counts()
# Total counts for normalization (relative frequency)
total_c2 = c2_counts.sum()
total_c3 = c3_counts.sum()
totalc2_c3 = c2_c3.sum()
# Create a DataFrame combining absolute and relative frequencies
claim_injury_summary = pd.DataFrame({
'without C-2 Form': c2_counts,
'without C-3 Form': c3_counts,
'Without both forms': c2_c3,
'Both forms RELATIVE FREQ': (c2_c3 / totalc2_c3) * 100,
'C-2 Relative Frequency (%)': (c2_counts / total_c2) * 100,
'C-3 Relative Frequency (%)': (c3_counts / total_c3) * 100,
}).fillna(0) # Fill NaN with 0 for categories missing in either column
# Display the DataFrame
claim_injury_summary
| without C-2 Form | without C-3 Form | Without both forms | Both forms RELATIVE FREQ | C-2 Relative Frequency (%) | C-3 Relative Frequency (%) | |
|---|---|---|---|---|---|---|
| Claim Injury Type | ||||||
| 1 | 6710.0 | 6867 | 2706.0 | 42.573946 | 46.085165 | 1.775423 |
| 2 | 7502.0 | 252951 | 3529.0 | 55.522341 | 51.524725 | 65.399024 |
| 3 | 108.0 | 45065 | 21.0 | 0.330396 | 0.741758 | 11.651296 |
| 4 | 218.0 | 69533 | 99.0 | 1.557583 | 1.497253 | 17.977357 |
| 5 | 21.0 | 11065 | 0.0 | 0.000000 | 0.144231 | 2.860792 |
| 6 | 0.0 | 859 | 0.0 | 0.000000 | 0.000000 | 0.222090 |
| 7 | 1.0 | 18 | 1.0 | 0.015733 | 0.006868 | 0.004654 |
| 8 | 0.0 | 423 | 0.0 | 0.000000 | 0.000000 | 0.109364 |
Multivariate Relationships ¶
# Calculate mean days for each industry code
# Calculate mean days for each industry code without resetting the index
mean_days_by_industry = train_data.groupby('Industry Code')['Days from Accident to Assembly'].mean()
# Plot the bar plot
plt.figure(figsize=(12, 6))
sns.barplot(x=mean_days_by_industry.index, y=mean_days_by_industry.values, palette='viridis')
# Add labels and title
plt.xlabel('Industry Code')
plt.ylabel('Average Days from Accident to Assembly')
plt.title('Average Days from Accident to Assembly by Industry Code')
plt.xticks(rotation=45)
plt.show()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[160], line 3 1 # Calculate mean days for each industry code 2 # Calculate mean days for each industry code without resetting the index ----> 3 mean_days_by_industry = X_train.groupby('Industry Code')['Days from Accident to Assembly'].mean() 5 # Plot the bar plot 6 plt.figure(figsize=(12, 6)) NameError: name 'X_train' is not defined
# Calculate mean days for each industry code
# Calculate mean days for each industry code without resetting the index
mean_days_by_injury = train_data.groupby('WCIO Cause of Injury Code')['Days from Accident to Assembly'].mean()
# Plot the bar plot
plt.figure(figsize=(16, 6))
sns.barplot(x=mean_days_by_industry.index, y=mean_days_by_industry.values, palette='viridis')
# Add labels and title
plt.xlabel('Cause of Injury Code')
plt.ylabel('Average Days from Accident to Assembly')
plt.title('Average Days from Accident to Assembly by Cause of Injury Code')
plt.xticks(rotation=50)
plt.show()
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[175], line 3 1 # Calculate mean days for each industry code 2 # Calculate mean days for each industry code without resetting the index ----> 3 mean_days_by_injury = train_data.groupby('WCIO Cause of Injury Code')['Days from Accident to Assembly'].mean() 5 # Plot the bar plot 6 plt.figure(figsize=(16, 6)) File /opt/anaconda3/lib/python3.12/site-packages/pandas/core/groupby/generic.py:1951, in DataFrameGroupBy.__getitem__(self, key) 1944 if isinstance(key, tuple) and len(key) > 1: 1945 # if len == 1, then it becomes a SeriesGroupBy and this is actually 1946 # valid syntax, so don't raise 1947 raise ValueError( 1948 "Cannot subset columns with a tuple with more than one element. " 1949 "Use a list instead." 1950 ) -> 1951 return super().__getitem__(key) File /opt/anaconda3/lib/python3.12/site-packages/pandas/core/base.py:244, in SelectionMixin.__getitem__(self, key) 242 else: 243 if key not in self.obj: --> 244 raise KeyError(f"Column not found: {key}") 245 ndim = self.obj[key].ndim 246 return self._gotitem(key, ndim=ndim) KeyError: 'Column not found: Days from Accident to Assembly'
# Calculate mean days for each industry code
# Calculate mean days for each industry code without resetting the index
mean_days_by_injury = train_data.groupby('WCIO Nature of Injury Code')['Days from Accident to Assembly'].mean()
# Plot the bar plot
plt.figure(figsize=(16, 6))
sns.barplot(x=mean_days_by_industry.index, y=mean_days_by_industry.values, palette='viridis')
# Add labels and title
plt.xlabel('Nature of Injury Code')
plt.ylabel('Average Days from Accident to Assembly')
plt.title('Average Days from Accident to Assembly by Nature of Injury Code')
plt.xticks(rotation=50)
plt.show()
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[179], line 3 1 # Calculate mean days for each industry code 2 # Calculate mean days for each industry code without resetting the index ----> 3 mean_days_by_injury = train_data.groupby('WCIO Nature of Injury Code')['Days from Accident to Assembly'].mean() 5 # Plot the bar plot 6 plt.figure(figsize=(16, 6)) File /opt/anaconda3/lib/python3.12/site-packages/pandas/core/groupby/generic.py:1951, in DataFrameGroupBy.__getitem__(self, key) 1944 if isinstance(key, tuple) and len(key) > 1: 1945 # if len == 1, then it becomes a SeriesGroupBy and this is actually 1946 # valid syntax, so don't raise 1947 raise ValueError( 1948 "Cannot subset columns with a tuple with more than one element. " 1949 "Use a list instead." 1950 ) -> 1951 return super().__getitem__(key) File /opt/anaconda3/lib/python3.12/site-packages/pandas/core/base.py:244, in SelectionMixin.__getitem__(self, key) 242 else: 243 if key not in self.obj: --> 244 raise KeyError(f"Column not found: {key}") 245 ndim = self.obj[key].ndim 246 return self._gotitem(key, ndim=ndim) KeyError: 'Column not found: Days from Accident to Assembly'
Conditional Mode - by Industry Code¶
# Prepare the data for a stacked bar chart
cause_industry_counts = train_data.groupby(['WCIO Cause of Injury Code', 'Industry Code']).size().unstack(fill_value=0)
# Plot stacked bar chart
cause_industry_counts.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='viridis')
plt.title('Distribution of WCIO Cause of Injury Code by Industry Code', fontsize=14)
plt.xlabel('WCIO Cause of Injury Code', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(title='Industry Code', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
# Prepare the data for a stacked bar chart
industry_cause_counts = train_data.groupby(['Industry Code', 'WCIO Cause of Injury Code']).size().unstack(fill_value=0)
# Plot stacked bar chart
industry_cause_counts.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='viridis')
plt.title('Distribution of Industry Code by WCIO Cause of Injury Code', fontsize=14)
plt.xlabel('Industry Code', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(title='WCIO Cause of Injury Code', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
/var/folders/2t/s2f8bw197wgg7dqr1kyxyqn80000gn/T/ipykernel_11503/950163664.py:11: UserWarning: Tight layout not applied. The bottom and top margins cannot be made large enough to accommodate all axes decorations. plt.tight_layout()
# Prepare the data for a stacked bar chart
body_industry_counts = train_data.groupby(['WCIO Part Of Body Code', 'Industry Code']).size().unstack(fill_value=0)
# Plot stacked bar chart
body_industry_counts.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='plasma')
plt.title('Distribution of WCIO Part Of Body Code by Industry Code', fontsize=14)
plt.xlabel('WCIO Part Of Body Code', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(title='Industry Code', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
# Prepare the data for a stacked bar chart
industry_body_counts = train_data.groupby(['Industry Code', 'WCIO Part Of Body Code']).size().unstack(fill_value=0)
# Plot stacked bar chart
industry_body_counts.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='plasma')
plt.title('Distribution of Industry Code by WCIO Part Of Body Code', fontsize=14)
plt.xlabel('Industry Code', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(title='WCIO Part Of Body Code', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
/var/folders/2t/s2f8bw197wgg7dqr1kyxyqn80000gn/T/ipykernel_11503/2326239920.py:11: UserWarning: Tight layout not applied. The bottom and top margins cannot be made large enough to accommodate all axes decorations. plt.tight_layout()